DOCQNEHT RESONB 



ED 081 785 



TN 003 1«5 



AUTHOR 
TITLE 

INSTITUTION 
FOB CATE 
NOTE 



Diederich, Paul B« 

Short-cut Statistics for Teacher-Made Tests. 
Educational Testing Service, Princeton, N.J. 
73 
12p. 



EDRS PRICE 
DESCRIPTORS 



MF-80.65 HC-$3.29 

♦correlation; Guides; •Item Analysis; ^Standard Error 
of Neasurenent; Statistics; Teacher Developed 
Materials; •Test Reliability; *Test Results 



ABSTRACT 



Written by an ex-Latin teacher, short-cuts to 
analysing test results for the non-Mthematical teadier are provided. 
Discussions are given of itea analysis (itea analysis by a show of 
hands, standards for test itens: success, standards for test itens: 
discriadnation, and the second stage of itea analysis. . The standard 
error is then presented (the standard error of a test score, 
estiaated standard errors of test scores, irtien two test scores arc 
"really" different, levels of significance, philosophic digressiop, 
the standard error of an average, standard error of a difference 
between averages). .T«st reliability coaputation is then described, 
and t^^e way to coapute a siaple type of correlation for a class of 
average sise is presented. . (DB) 



\ 



\ 



\ 



ERIC 



t 



SHORT-CUT STATISTICS FOR TEACHER-MADE TESTS 



by Paul B, Diederich 



U t OIMaTMtMTO»MtA(.TM. 
lOUCATlOM ft Wll^r A«l 
MATfpWAt. IMS TiTUTt Of 
lOUCATlON 

This OOCUMEMT has BEEM AEPffO 
DUCED EXACTLY AS WECEiVEO FROM 
THE PERSON 0* OI»&*Mi/AT»OMO*lCm 
ATINOIT POINTS C VIEW 0« OPINIONS 
STATED DO NOT NECESSAfflLV REPBE 
SENT 0^^ IC I AL NATIONAL INSTITUTE 0^ 
EDUCATION POSITION OH PODCV 



EDUCATIONAL TESTING SERVICE. PBINCETCN. NEW JERSEY 



OO 



CO 



Copvf'i^^t ' 1960. ig^i-l 1971 by fducatioPsI Testinp Service. All ixfhXs reserved 



CO 



9 ■* 



(_ 



For the XoiMiia(lu*niatir;tl Tc;k her 

Thr wriliT is nn cx-Lnlin trnclnT witli tliirtv yojirs 
(<':ic'liitm t*x|}cri('iirc wlm w.is .ittnictcft to trstin;: !>> tlii* 
fnrt tli:if so iinlch nnns<>iis<> is written nnd spi(k(*n aWoiit 
(•(ItKvilton. !!<• u'.inti'fl tci find ctit. ut least in his riun 
clnsscs. wh.'it worked nnd wlml did iif)t work l>v menns nf 
tests of lii.s own ronstrnetion — Iwitli es'^Jiy tests nnd olijee 
live tests. Sinre it t«)ok him lon^;er thnn lie cnred (f> spend 
to .'in.'ily/e his fest results hv fhr preriso :ind ele^iiiif 
methods f;ivored hy sMtistiei:ins. he ;:r;idii:i)ly learned or 
developed short -cuts tlint > ielded :ippro\iin:itely the same 
n suits. 

All of tlieso short -cnts hnve passed two hasic ti^sts. First. 
Ihey werf ;i|| applied to aetiial flata hy the writer's Min 
while he was in tlu» eij^hth j:radp. making TVs in arithmetic, 
and h(* had mi tronhio with the niathem.'itics. Second. thi'V 
havo all heeii discussed with conipoteiit statisti(*ians who 
winccfl sli^^hfly lui( agreed that the methods are vali(I for 
the purposes for which n)osl toachors will nso thorn, and as 
precise as the data from rla.ssrnom ttvst.s will ordinarily 
warrant. 



Ilcm-annly.sis 

Itom-nnnlysis hy n show of hiinilH. One nf the chief advan- 
tages nf pnl)lished tests over teacher-made tests is that the 
f(»rmer are pretested on a large numher of students like 
those for whom tlie te.st is intended, and then the pro- 
fessional test -maker gets figures on (a) the success of the 
group on each item (what percent got it right); (h) the 
discriminating power of each item (hased on how many 
more Iiifih-soorinfi than low-swring students got it right): 
and (c) how many high-scoring and low-scoring students 
chose each response to each item. Yhe test -maker then 
discard.s items that are too hard, too easy, or nnn-discrim- 
iuating, or else touches up items by revising some resp(»nses 
or substituting others. Usually at least half of Die items 
that are pretested in this way are either discarded or re- 
vised, and the final Ukm of the test contains only items 
that are likely to work well. 

Toiieher.s cannot pretest items for im|)ortant tests on the 
same group that is to take the final forms nf these tests, 
for that wnuld shnw them what questions were gning lo be 
asked, and stuilonin would lx>ne up on them. However, if 
teachers item-analyze each important tost after it i.«j given, 
they can gradually build up a file nf test items that have 
worked well in the past or have been revised to eliminate 
faults that appeared in earlier forms. This fdc will )>ot)i 
reduce the work of constructing tests and improve the 
tests. If the file is large (as it very soon will be), student.s 
seldom learn what questions to expect. Examiners report 
very little tendency for old items to get **easicr** as the 
years roll on. "Kmuhssion to mmooucc this co^. 



MOHTiD MATf MM. HAS MtN OMANTED ■¥ 



ERIC 



rnfortwnately. th(» (»/iIy \\:iy of makim: an ilem-aJial^ ^Jis 
that is I'xplained in the Iwxiks on tests and measurements 
is s<t l;dn»rifMis and timt'-consuming that no tearher who 
tried it one<^ is ever lik«-ly to try it again. It consists 
of firepariim a form and then fuitting d(»wn a tally for 
l ai fi stu(]i>nt s ri'Sponse to every (piestion — in other word«. 
ropy in all answers to alt q\iesti<»ns. If (here are -10 (pies- 
tioiis in the ti st and 10 students, that means putting down 
I.(HMi (.lilies. If one is rar«*ful. it also means checkini: 
e\ery tally, since norhiaj: is e.'i.siiT to misplace than a tally. 
It one skips an item, for example, all of the tallies down (o 
the point :it wliidi one disccniTs th(^ error will re(H»rd the 
student's answers to the wrori'^ (piestions. Jlenee there will 
he at least 'A.'200 operations to r)erforni. not counting the 
corre<-ti(ui of errors, for each f»»rty-iiiinute test in »>ne class. 
It is not surprising. tlien*fore. that itiMu-aiialysis is almost 
ne\er applied to teacher-made tests, even (hough it is the 
iKisic op(Tatioii that all published tests have to undergo 
and tlie basic reason for \vhale\7T superiority they poss<ss. 

all of (his work can Im» done hy a show of hands in 
class in so little time that students do not resent it. It adds 
greatly to their understanding nf the t(\st and is a better 
basis for class discussion nf items that gave tnudih* than 
having students suggest items to discuss. Tlu' bright stu- 
dents an* naturally the first to responrl, and they tend to 
suggest items that present subtle problems of interpreta- 
tion. On(? may n(»ver get to the items that reveal the basic 
weaknesses of the cla.s.s. 

For routine tests, the teacher may call out the numlx*rs 
of the items one by one. Each student holding a paper that 
got that item icroftfi holds up bis hand. The teacher counts 
and announces the number of hands that he sees for each 
item, and writes (hat number opposite the item on his own 
copy of the test, encircling item.s that call for discussion. It 
goes lilce this: ' 

"Item 1. How many of you are holding a paper than got 
item I wrong? Hold up your hands. I see three hands. Any- 
one else? Let me repeat my question to make sure that yon 
have this straight. Look at item L Is it marked right or 
wrong? If it is marked ivrotif*. hold up your hands. I now 
see four hands. Larry, what was the trouble? You thought 
T meant rif*htl No. that is the other kind of item-analysis: 
here I just want to find nut which items gave us the most 
trouble. Now go on to item 2. Hands? I see two hands. Item 
3? No hands. Did nol)ody get it wrong? Very good. Item 4. 
I see fourteen hands: we'ff have fo discuss (hat one. Item 5, 
zero. Item 0. two." 

And .so on. Remember that the teacher records the niim- 
hex of errors opposite each item on his own copy of the 
test, and encircles questions that enough students mi.ssed 
.to warrant discussion. 

For more important tests, the teacher may want the 
"high-low*' type of item*analysi» that will also reveal the 
discriminating iwwer of each item, as shown by the fact 
that more high-scoring than low-scoring students got it 



. TO IMC AND mumUkVatA OPtflATIMO 
UMOCM AMf MtMTt MITH THE NATKMAL m* 
tTlTUTt Of fOUCATKM riMTNEM .MPWa 
OUCTKM OUrw THE CMC 8V$iC¥ m* 



FILMED FROM BEST AVAHABLE COPY 



rislit. Pnjfcssionnls use thi'; (op 27% in total scqros on iho 
lost t)w "hi^,4)" ^rou\\ iho bnllom 27% ns ihb "low." Tf 
llio loaclior uses those j)i'oj>oi'ti()ns, " he can use; iho lU^ni 
Analysis Tabic, proparod by Chiing-Toli Fan in 1952- 
(available throup:li O/Ticr; of Information Services. Ediica- 
i.ional Tostinj^ Service, at hvo dollars por coi'jy) f,o Icjpl^fvVip 
an (ho ik?m-st£} Iff? tics* disciis.socl in [h/^ followingf'sGcf.ion. 

TIio writor, who has boon conducting those itcnvconnls 
by a show nf linnds for years in his own .classes, prefeis 
usinK lop arid bottom halves for tlie reason that othorwis'o 
the wliole middle half of the class has no'thing to do durinft 
tlio i<G!n-ana]ysis and feels lof/ out and gets into rniscliief. 
One iTHJst expect smaller dilTerence.s in percent corrc?ct than 
one would ]^et between the top and l^ottom 27%, l>t.it it is 
still cjuitj^* clear how much of a dilTercnce is desirable. It 
oui^hi io be at least 1.0% of the cfass. In a class of 40 stu- : 
donts. at least four more, students in the? top half than in 
tJi<? bottom liair sliouUi f?el'an item riglit. 

This figure was not chosen at random or by rulo-of- 
thumb. Here we must get just -a bit technical for a mo- 
ment.- because part of the fun is the pure swank of knowing 
w ha t th e ex pe r ts ^ a re ta 1 k i ng abo u t,- and knowing" tha t on e 
has compnrablo Pg^iires for one's own tesLs. TJie index of 
discriminalion that they use is called the "biserial correla- 
tion with total test.*' It is a decimal lhat shows to what 
extent success on the item is related to .success on the test 
as a whole. Putting it another way. it tells the extent to 
which people who did well on the whole test did bettor o]i 
this particular item than •]}C0})le who did j)oorly on the 
whole test. The: prpfessioJials like to have their cwcragc 
biserial above- .4 and are quite )5roud of themselveis if it hits 
.5 or above. They look bar J at items with biseri^ls below 
..3 and either touch them up or get rid of them ^jnless they 
can prove on other, grounds that the item is : good item 
that is not closely related /to the rest of the te?it. 

Now, it just happens that, for items in the middle range 
of difTiculty {that 25% to 75% of the students answered 
correctly), the biserial correlation with total; test is ap- 
proximately equal to three time? tiie higli-low difTerence. 
e.vpressed as a percent of . the class. This is true when the 
Iiigh-low difference is hased ofi high-low /ia/u<?£J of th'^ class 
— not otherwise. If tlie high-low difference is four. , and 
this is 10% of the class (o^' 40 students), the biserial cor- 
relation of this item with the total test will: be ^ipprox'i- 
mately .30. If it is six, or 15% of the class, the biserial will 
be approximately .45. This approximation does not get 
seriously wrong until one reaches items that more than 
80% or fewer than 20% of the class answered correctly. 
For these extremuly easy or extremely difficult items, -it is ^ 
usually a serious //«f/c/-estirnate of the true biserial. One 
consolation i.s ihaU while such items riiay .be. highly ; dis- 
criminating, they discriminate for a very small fraction, of 
t tie group. St-iU, one occasionally wants a very easy or a 
very hard item. In such cases a" high-low diflerence of even 
5% of the class may be quite acceptable, and ccri^ijnly 
anything higher is hard to get, but the difTerence between 
high-low halves is not a good index at these extremes. 

' Do not. fear that you will have to compute these percents . • 
for every item. When you begin each item-analysis, divide 
the. number of students vvlio are pre.sent by 10 and round 
to thtl nearest whole number. If 3S are present, the mini- 
mum acceptable high -low difTerence will' be 4. If an item 
exceeds this number, its discrimination is satisfactory; if 
not, you will, have to look at it to see whether anything is 
wrong. , . 



After yon or the students have fmishnd scoring the toit. 
arrange the papers in descending order of total scores nnrl 
count down: io the middle score. Suppose this score is 21, 
and five students made it. All paj^ors above this sccj^re 
obviously go into the high group; those below go into the 
low. But wliafc about the five middle papers? Put themi at 
random into the high and low piles until the nuniber.9 in 
each ]>ilo are equal; If you have an odd nunilx^r of stiidcnhs, 
hold out one middle paper and do not co.'nt it in tf»e item- 
nnalysis. The student who does not got a paper will bo' the 
score-keeper and will write tlie/figures for each item on the 
l)la'ckboa rd; otherwise the teacher will do it. 

Now it is necessary to have a clear se]iaration betwcjon 
the "highs" and the "lows" in the classroonv To avoid 
shifting the studonls. those on the right may get the high 
papers, those on the left the low; or tliose in front I may. 
get' the highs, those in t)ack the lows. The teacher a]-)!ioints 
a counter for each grouj);to call out ti o. number of -bands 
raised .in his part of .the,- room for each item. j 

"'"he four figures obtained for each item may be labeled 
ana :le fined as follows: , ■ ! • 

_ :,^/..: '"....-l' ' 

H = the number of highs who got the item right | 

L= the' ndmbei- of/lows who got the item right . J. 

H + L='"SUCCESS'*; (the ^oiar number who got th|e item 
bright)- . ■ /■ ■ - ^ ■ 

H - L- "DISCRI MiN ATION" or "the high-low diff^nmcc'' 
. / (how many more highs than lows got the itenji right) 

The teacher calls out the numbers of the items I one by 
one: e.g., 'Mtem 1." Everyone whose paper got that item 
right, holds up his hand. The counter for the highs t alis out 
the number of ui)raiBed hands in his section: e.g. J Four- 
teen." Then the counter for the lows calls out the number 
of upraised hands in his section: e.g., ''Eight.'* The score- 
keer^er, be he ^ teacher or student, immed lately adds these 
two figures and calls out the io/a/: e.g., ''Twenty- two." He 
then subtracts the lows from the highs (in his head) and 
calls out W\G-dif!<irGhcG: e.g., **Six." Everyone copies these; 
four figures at the bottom of item 1 on the copy of the tost 
that he is holding: 14 8 22 ' 6. Tliere is no need to label 
them, sjince; this is a standard sequence,* and before long 
everyone: will know 'what it means. The rhythi^ of the 
operation is approximately as follows: Item 11 Hands. 
Pause for counting. 14. 8. 22. 6. Item 2. . . . If thp teacher 
or 0 student wants to call for any of these figures again, 
the propeir short form of the question is, * 'What was the 
high? thelow? the total? the dilTere [ . 

Afti^i a^ lit'ile practice, the complete item-analysis for a 
one-period test will take between. ten and twe 
dei)ending on the number of items. It vvbuld ; take the 
teacher at least two hoors to do it at home,- and; he \vqnld 
make far more mistakes than will be made in class,, where 
every alert student will be only too happy to pounce on 
any mistake in coujiting, adding, or subtracting.; Teachers 
in the writer's measurement classes have conducted, .such 
item-analyses a.s far down as the fourth gradej. and have 
re]3orted that the students had no trouble understanding 
the procedure or carrying it out. At the other end of the 
scale, even students in graduate courses do not resent it. 
It gives them visual, auditory, and tactile cIuoS |to the sucv 
cess of the class- on each itein, and it shows them graph- 
ically, and convincingly which itemo separated the sheep 



from I lie goals. Tli<'y sol porsonaKy involvrd in Hfulin^'oiif 
luKv well {.he chiHs flifl on ihn nnrl why \hoy wont 

\vrnii.%' c»n Iho itc»nis I hat fjave (.rouble. By contrast, if (ho 
((^achiT floos all (lu? work for ihoixi ^it liomo anrl liands 
fhon) Ihc rnsulls of hts annly.sis on a plalfor. no ono will 
iind(»rs<an(f rtn(/ no cfno will he intcfpslivl. Thoy l)avo Jo 
into iho act if tho analy.sis of a lt\st is to ho a niovinf^ anfi 
ontiKht.oninj? expc rionco. 

Standards for ioai items: supccss. It is a romtium liolief 
tluU mosl tosts should start with very oasy iloms. f^radiiaDy 
fXi>t tinrd'T. and oml with ovory hard ito)-)i.s. 11" this S(»C|LH'nri> 
is hard to arrange, at least tlio test sliotdd cover a wide 
rankle of iteni-diffieulties. While many profe.ssionals shart* 
this vi(»w. il is worth. kno\vini4 that praclicalty ev^ery serious 
invest itia lion of this prohleni since 1932 lias come iip with 
the opposite eonehision: that jireeisioii of measuronient is 
Sreaii^si all of {ho items in a (est are ah()iH equally 

diflieult for Ihe ;.'ronp tested: thai maximum reliability and 
disjK^rsion of sc^Mes will be attained if every itein in the 
usual sort of nniltrple-chDice test is answered correcUy by 
somewhere between 60% and 70% of the students tested. 
We do not want tf) insist on llii.s point, since tlie advaiitafje 
of a nar.row range of item-difTiciiities is very small in rela- 
tion to other sources of.,yalidity and reliability, anfl since it 
is usually almost impossihie to achieve a narrow ranj^e of 
itcm-difFicuIties. Still, teachers sliould know that if tlicy 
.sweat liard in order to achieve a nice i^rogrossion from 
easy tp difficult, their elTort lias probably been wasted, and,, 
its mo.st proliahlc effect will be precisely the contrary of 
what they expect. They exj^eet it to yield a wider spread 
of scores. What it actually yields is a narrower spread of 
scores than if all the items were of appro.Kimately equal 
difficulty. Hence items that more than 90% jjot right should 
he fiuestioned as too easy, and items that fewer than 30% 
fijot ripht as too hard for inclusion in a test. Questioned, 
mind you. not rejected — -for they may be justified on other 
grounds. ^ 

.... 

standards for test Hems: discrimination. It has already 
been indicated that the minimum acceptable high-low diffor- 
ence hy professional standards is 10% of the class, and why 
this is so, e.xcept in very easy and very hard items. The 
"standard e;rror" of this sort of higli-low difforence, how- 
ever, is so ferge that at least a fifth of the items that turn 
out to he quite discriminating^ after repeated nse may fall 
below this stlandard in any one administration of the test 
by pure chance. Hence we should be wary of rejecting an 
item if it falls below the suggested msnimimi the first time 
it is tried* if, after due consideration, we can find nothing 
v.Tong with the item. It is quite strict enough to say that 
not more than a fifth of the items in the final tost sIkmuo 
fall below this standard, and the average high-low differ- 
ence should be above 10% of the clas.s— prefeVably 15% 
or above. High discrimination spreads out the scores as 
widely as possible and hence increases the reliability of the 
test. 

A teacher who uses this method of item-analysis will soon 
find out that high-low differences for sorne of his items will 
be zero or negative: thai is, the same number of students in 
the top and bott(7m halves may get them right, or more 
low-scoring than high-scoring students may pick the keyed 
answer. One of the chief uses of iteiri-analysis is to direct 
attention to such Items. While this sort of thing can happen 
by pure chance, a closer look at the item will often reveal 
why the better students shied away from the intended 



answer. Ono can touch up the amhig\uty or inaccuracy and 
tiiereljy .save* not only ti\e «'^m but (he resentmeni of fuiuro 
students* who would be bright enougli lo del(^ct the error. 

Ail dtscrimiufdion figures look wonderful toward th(> end 
of a tesi tha{ only Ihe high -scoring sUident s were able to 
finisfi. For example, it niay appear that almo.s( all c»f (he 
high-scoring students and none of the low-scoring .students 
answcM'i'd tiie last item correctly — which would be ideal if 
it ^^'ere not spurious. AH the low-scoring stiidenls might 
ha VP known the answer Init simjily did not reacli the item. 
ACtrr a fifth c»f the studtvUs have (Iropped out. i I em -analysis 
figun»s arc .so misleading that il is well not to continue 
(he analysis hoyond this !>oint. 

Second stage of item-analysis. There may he a few items 
in a test that turned out to he too easy, too hard, or did 
no I discriminate satisfactorily for no a])paront reason, and 
i'la.ss discussion does not reveal anything wrong with tliem. 
If (here is (Ime. (})ese may be subjected to a second stage 
of item-analysis, whi(*h is too laborious and time-consumiiig 
(o apply to UKtre than a few items. For these feW items, one 
asks how many \\\ the high group, and then how many .in 
tfie low group (a) omitted the item, and (b) cliosi^ each 
response, l^esults like the following may indicate what is 
wrong: 



Responses 





Omit 


1 


2 


3 


A- 


5 


Higl> 


0 


n 


9 


0 


0 


0 


Low 


0 


14. 


4 


2 


0 


0 



The right answer, response 1, is indicated by a lino between 
the highs and lows who chose it. Three more lows than 
highs chose it; hence its index of discrimination is ~3. 
Why? Tlie figures for response 2 suggest an answer. This 
response was too attractive to the high-scoring students. 
Perhaps . they thought resi^onse 1 was too obvious; they 
.suspected a trap; then they figured out some interpretation 
of resiionse 2 that they could defend as the right answer. 
If so, discussion shoukl reveal what interpretation they gave 
to response 2, and it can he revised in a vvay that does not 
perrnit t he jnterpretatli^n.* At tlicsame time, responses 4 and 
5 might be made a shade more plausible, but still definitely 
wrong, because in their j3 reseat form they were wasted; 
nobody chose them. Tnoidentally. item-analysis has prob- 
ably been a factor in reducing the five-choice item, which, 
was standard a generation ago. to the four-choice item 
which is more popular today except 'a a few item-types 
(such as spelling) in which the fifth response is usually 
"none of these." Item- writers were not very successful in 
framing five responses that were all sufficiently plausible 
to "draw blood." 

/ •* J 

Tlie Standard Erroi' 

The standard error of a test score. Since we have already 
introduced tJie concept of "standard error" in connection 
with high-low difTerences, this may be a good time to 
extend the concepts to test scores. The first thing to be 
said about it is that the standard error is not computed 
in the same way in these two cases and is not of anything 
like the same" magnitude. If you look in the index of a text- 
book of elementary statistics^ you will find at least fifteen 



3 



(lifTf'i'iMil kill 'Is (jf siand^Kl errors: of si'or(^s, avprntres, 
riifTtTi'iicos. r iriclations. pruporlions, I'tc. Tlu'V ar;* all onm- 
putcd (lifTiTcndy i\n<\ yivUl fi<;urc'fH of (liffprcnt orders ot' 
rn;i;;nitu(l<*. Tin- slaiularil error of .'in avora^c. (or v\iitn\}\('. 
is usunlly irti rh snmUrv iUnn iho siini(h\n\ error of a sinj,'le 
score. whiJi' the standard error <if Ihe flifTerc!u'(> between 
liie hvo sfor s is larj;{'r tlum the standard error of <'ither 
seove. Tliey dl have (lus basic meaning in eommon, how 
ever. Suppose you repeated a eertaia measnrement opera- 
<ion a funidriHi times and krpl iivi-fiVt^iDi^ ihi^ re.md^s uniW 
petitions W(Md(! change that avorn^e one inta. 
nU <if t)ial final a\:era^|e as the "true" m<'asni-i', 
it is a score on sjjellin^. tho average of 
nfTerenee h<M\vi'en (U'r) elasses, Ihe (MMrelation 
llinj; i\ud verba/ intenifjence, or wliatnot. Voi; 
luark (»rr Ihe points that would enclose t ho 
thirds of tin* fi«;nres you K«'t t»n tlie various 



no further n 
Y«nj may thi 
no matter wl' etlnT 
a class, the 
betwei'n sjx 
mi}^)\\ then 
mid<H< *wo 



trials on vmir wav to Ifiat final averai^e. You wouh 



these point 



on' standard iMTor above tlio true measitre 



ihc] 
first trial 
of three t 
error of l\ 



and one sta iflard error l)eiow il, You nu'^ht thi»n }xo on (o 
mark iho p(>ints lb;d Wi add enclose the mifldle 9^)% of all 
Ihe figures yrju j^ot. on th(* various trials. Yon would call 
lhes(» points two st a nd a nl errors above the true nioasurc 
and two standard f'rrors below, There would still b(- r^"), 
of extremely deviant fi«<nro.s beyond t.h<*se two jjoinfs. but 
the limit.s c|f two sfandar(l errors wcjtdfl enclose most <jf Ihc 
fi^4ures that you would ^^et. 

The troi. ble with api)lyin^' this conct'iH l.o testinj; is that 
we are never sure what the ''true" measure is. i^\nrv do 
not have time in seh(>ols to measure (lie same attribute a 
iuinffred tfnu-s. ami if we <lid. we wo(dd chancre if beyon{l 
rec'of,rnif ioM. But statistical theory pcrmitij us to comjnde 
he .standard error of most measurement operations on the 
. and then wtvean .say that the chances are t wo out 
lat the obtained figures lies within one standard 
J e true fif^iiro, and 95 out of 100 that it lies wtthifi 
two .stanci.ird errors. 

The nekt thing to be said about the standard error is 
that it is not the same as the ^'probable error" tliat was 
popular d generation ago, but it is based on the same idea 
of the limits within wtiich measures may vary by pure 
chance. Bind cither {ii;urc may be translated into the other. 
The diicf reason why the "probable error" i<s no longer 
used is fihat there is no way to compute it tl iroctly: one 
first has! to compute the standard error anci then take 
a)jproxii|iatoIy two-tb'irds of it to got the iirobable error. 
The only point irt doing so was that the early statisticians 
thought I it would be easier for the hayseeds to grasj) the 
idea thjit the chances wore fifty-Hft'- that the obtained 
figure vy ould lie witliin on e "jirobable error" of the true 
figure, rather than that the chances were two to one that 
it wouici lie within one "standard error." On mature reflec- 
tion, however, it seemed that the finst idea was not really 
any eaipier to grasj") than the second, and it was rather silly- 
to keeip on performing an extra operation every time one 
computed an error of measurement just to make the figure 
more i|jpealing to -the laity. The name "probable error" 
undeniably had more popular aj>peal, but the appeal was 
spuricJus on two counts. First, this kind of "error" is not 
''probable"; it is certain. Second, it gave the idea that some- 
one nlay have made that much of a mistake in taking the 
measwre. If any such mistakes are made, they are not in- 
dudea within this type of "error." It must be understood 
in its root sense of "variation." It assumes that all the 
measures have been taken and recorded accuratelv; even 



so, you arj> not ^( ing to gel the same figure twice except 
}ty Uick. T\\i' '"errcir" indicates within what limits the ob- 
t.'u'ned figur(^s are likeiy to vary by luire enhance. 

Not all kinds of < "ha nee, }wnv(*v<^!'. If a teadu^r gels nngr>' 
at the students who Wi rv absiMit during a crucial examina- 
tion and sees to it that the makc*-up test is harder and 
niarketi more si»\(M<-1v. Ilieir s<'(u<'s will di|) in a way thai 
could n(it be prj'diclr'd niathcmatically. Misfahrs in writing 
items, scoring, or marking Ufiititetifled ansvvors and cxtrnuil 
i-jrri/mshnti'cs that may aiTect scores, such as sickni^ss, 
noisf. intt'rrufjtions. ho\ slii'ky (lays, etc., arc also lu-yond 
the paic of t}ie standard error. The <mly kind of Vfjriatif)n 
in scores that is standard ar\d therefore nn^asurahlc is 
"samjiting error," Suppose you want to find out how well 
yf>nr students can spell, Tb'.'re are al lea.st tiOO.OOtJ lOnglish 
wor<fs that you might ask them to spell, hut let us suppose 
that there are only 10. ()()() tliat they wcudd orrlinarily he 
asked tr> spell by tbi' end (tf grade (i. If you select UK) iA' 
IIk's*' words completely at randotn and get afi accurate 
scoro on th(» mnnber ihoy were able to sfjcf). the score will 
give you an (\sliniat(^ of the p<«rcentage of the 10,000 vvor<ls 
that thi'v ari' pi'obnlily able to spell. But if you take another 
1(10 Words froni the same ])ool of words coniplet(dy at ran- 
dom, you know that very few students wiil g<»f (exactly 
the same .sc<irc as on Hie first 100. This variation, due to 
the sam)de that ha))j)ens to be chosen, is what the standard 
error meims. 

The variation will bo mudi larger if iwo ■ different 
teachers independently try to find out how well the same 
class appreciates Hani/rL Here the number of vahVI cjues- 
iions that tliey might ask is thoorctii-ally infinite, but each 
has time to ask only 40 questions. If wo can regard each 
set (if (jueslions as a random .sample drawn from an infinite 
pool of it.em.s testing the same ability, the variation in 
scores rr<)m one such samjile to another is the sort of thing 
that is measured by the standard error. In priiclicf?. the 
variation will be much greater, since the teacher's bias wiM 
afft^ct bis selection of questi(jns: one may he a bear on 
. character development, the other on figures of speech. Thoy 
are not measuring yie. same attribute at all. even though 
both call it "apjjreciatiori of Hamlet'' 

For liiose reasons, tlu^ standard erro» iccounts for only a 
small ])art of the variation in scores that may be expected 
in practice, but it is quite large enough to make us want to 
get several independent scores bef<ir-<,^ we make up our 
minds as to the deg'ree of success of ()ur students in niU\'n- 
ing the objectives of the course. The standard error tells 
within what limits scores may he ext)eeled to vary by pure 
chance in ffic .^iclection of items. If we add to that our own 
hioff in the selection of items, the sfupi'd tnistakes we make 
in vvriiing the itcrr.s and in scorinj^ them., and fxtrrnn} 
nrr/i}/fs/{itirr\ Hint may "a/T(>ct the a})ility of the shidcnl,s in 
an.swer the questions, it is obvious that the variations we 
may expect between two indeiHMulent measun^s (}f an 
ability that we refer to by a single name may be quite 
large. It is not so large, however that we should despair of 
eve'r being able to find out winch of our students have been 
more successful than others in attaining the objectives of 
the course. Since we usually have them for a full year, we 
need never rely on a single measure but can give them a 
long series of measur.es. Any one measure is like any one 
baseball game, in which the team that is in the cellar may 
clobber the team at the tojD. But over the wliole season, the 
team that is really superior will rise to the top, and the 
team that is really inferior will fall to the bottom. 



Estimated Standard Errors of Test Scores: 







r.i'HoN-.: \{ 










t.MUOH 


ihr sl^ifuJurl 








< 2 \ 


•1 




>rmr I'i y.i'in or 


li't I: 






:\ 


1 \\\\fl\ 1 Ml 


l! |ninU> frmn n 


41 ■ (nnii 


100'; 




! 




? [MiinN iViuii (I 


oi t'ruin 


KM)'-; 




.") 


;i \\\u'U H lu 


ITi jxitiUs Ironi 0 






110. {29 

























This table may ho inirrprotecl as fotiowK: In an objcci.ivc 
U'st oi* 50 itomb\ two .scores out of throo uDl iio within 4 
rmv-si'oro points fono sfanrlard error) dl" the "inic score" 
thf\se stiKlnnt.s would attain if you conlinuofl ■ tostin^ with 
ropoatt^l ranflom snniplos Croni ihe unlvovso of items testing 
the samr aiiility, and 95% of the scores will (ie within 8 
raw-Kcore points Hwo standai'd /M-rors)" of "true scores." 
The ro!altVi»ly few seorcs at the oxM'ome.s wil) hfivr sJi^htly 
smalh^r standard errors, as indicalnd under "Exceptions/' 
hut there aro usu;f4}y no! enough of these to jastify sepa- 
rate treatment. 

If ytHjr lornl Direcfr>r of> Refioarch ca.sts aspersions on 
this tal.)le. aiik him to jead two article's hy Fredoric: M 
Lord. **Do Tests of the' Same Lon^Lh Have the Same 
Standard Errors of Mcasuremont?" and **TeHts of the S£im<^ 
Length Do Have the Same Standard Error of Measure- 
nu?ni" In Eduvaiiotuil and Psycholof^ical Meafn/ranmnL 
XVn. 4 (Winter, 1957): 510-521; and XIX. 2 (Summen 
1959): 233-239. 

When iivc two tesi scores "rea!ly" different? Cooperative 
Tests and Services, Educational Testiiig Service, has been 
the iin\t major test publishor to enforce attention to tiic 
standard error of test scores hy reporting scores on its neu 
SCAT and STEP tests as bands rather than as .points| 
Each *'band" extends from one' standard error below the 
obtained score to one standard error above, and it is ex- 
plained that the chances are two out of three that thi 
"true" score lies somewhere within this band. Teachers ara 
urged not to regard two scores as "really" difTerent unlesis 
the two bands do not overJap: i.e., unless the two scores aile 
at least two standard errors apart. ^ 

While this is a great imj^rovement over previous practice 
in interpreting dtlTereMces between scores, a teacher wmo 
has nianau'ed to read, diis far without losing his grip My 
want to cany this line oi thinking a stej) further in ordbr 
to get hold of the concept of ''the standard error of a diffjlr- 
ence." It was indicated in passing on page 4 that the stand- 
ard error of a dilTerence between two scores is larger imn 
the standard error of either score. Think of thf» difTerericc 
as a rope tied between two stakes, which are the i wo scoies. 
Since there isi wobble in b<>th stakes, there is bound to be 
more wobble in th^ rope than tht-re is in either stake- 
To get the standard error of the dilTerence between .wu 
scores, "square the standard error of each score, add the ,:wo 
squares, and take the square root. For example, it ivas 
shown above that the h tanda rd error of a ie.- l of 24-47 
items is 3 (rounded to the nearest whole number). Tnree 
squared is /u'ne. the square of the standard error of <iach 
score. Nine plus nine is eighteen, the sum of the squarc^s of 
the standard errors of two such scores. The square roU of 
18 18 approximately 'l^/j. This is the standard error of the ' 
difference between the two scores. You can see at Oncejthat 
it is appreciably larger than the standard error of ekher 
score, which is 3. " ' I 



Now, if you want to he 9b% sure tliat the Iwo scores- 
represent, a true dilTerence in ability, the difTcrence between 
th(»m ought to be twice the standard error of the differ once- 
— not twice the standard error of either score. In other 
words, the two scores should lie at teast points apart. 
iH)t just (3 points apart as the Cooperative Test recommen- 
dation implies. The Cooperative people are. well aware of 
this point but do not use it in reporting scores because 

(1) it would he too comi-jlicnted for, teachers to. square, add. 
and take a square root. l)Gforo comjDarinr any two scores; 

(2) if (wn bands flo not overlap, they usually do not touch, 
anii the distance between them is likely to reach statistical 
".^iignirioance"; (3) eveii when tliey do touch, the d inference 
between Ihe two scores is **sjgnitlcajd" at about the 15% 
level, whicli is good enougli for most <'lassroom purposes. 

Levels of significance. When ijcople report "findings" 
rather than "ojjiiuons.*' it is common practice for them to 
tag each .'Tinding" as 

(significant at the \% level) : 

(signiHcant at the 5% level); 
NSCnot significant). 

Th<? last is professional shorthand for '*not significant 
even at the 5% level." Thus, the cli/Terence bebveen two 
Coo})eraLive Test scores whose bands touched hut did not 
(wrrla[) would be reported as "not significant" — because it 
is significant only at the (5% l^yel. That is, out of every 
100 differences of exactly this size, 15 might be duo to pure 
cliance in the selection of items for ihe test. In any one of 
these cases, there is no way to tell whether the difference 
was "real." One can onl)' report, after computing the 
. '\voi)ble" in the measure, that there are 15 cIinnccH in a 
huiidred that it migld have been a fluke. That is commonly 
regardeti as "not significant." 

.Tfc is obvious from this that a statistician is a man who, if 
he remains true to his prini'iples* would never bet on horse- 
races. He is willing to say that a difference is 'Veal" (i.e., 
not a chance difference) only if there are less than five 
chances in a hundred that the obtained difference could 
luive come about by accident of sampling. Even this fs con- 
sulered rather a grave risk, and ho is really happy only 
when there is less than one chance in a hundred that the 
dilTerence was a fluke. Since he also has a knack for invent- 
ing names that mean the opposite of .wliaf the layman 
would think he meant, he calls these two points **the 5% 
level" and "the \% level." These sound as though the sec- 
ond was less significant than the first, hut the opposite is 
true. The first means that there are less than five chances 
in a hundred that the difference is a fluke: the secorid'^that 
thero^ is less than one chance in a hundred. Although vJie 
would shudder at the loose language, surely we are justified 
as laymen in tliinkuig of the first as ''95% sure" and the 
second as "99% sure" that the difference is "real.'' We 
ought, hriwever. to be sure-footed in our definitions of these 
looser t erms. "Real," for example, here ..means only **non- 
chance." It does not necessarily mean "true," for if an 
oxperiaient was set up by a very biased person, it might 
yield results that were the opposite of the truth (as it ulti- 
mately emerges from the consensus of later investigators). 
It would still be iiroper to say that the results obtained by 
the first investigator did not arise by chance — by accident 
of sampling. They arose from bias. 

Since bias, stupidity, and carelessne??§ seem far more 
likely to the layman to vitiate the results of experiments 
than pure chance, he wonders whether it is worth while to 



cJiscoLinf; ihe e/Tccf. of ch^inco alono. Tfio answer seems to 
be that; it t\s worth while, cliiefly because almost all educa- 
tional measurements contaiji so large an element oT pure 
chance that miiny score clifTerences can lx» attrihutod to 
ficcidents of sanijjling. The critic can go on to consider 
whctlier the remaini7ig di/Terences are trne and important, 
or simply the logical result of the stujiid and biased way Ji 
which the experiment was conducted. 

But how does one establish these two levels of "signifi- 
cance''? First, a difference is significant at the 5% level if 
the di/Terence is hvico as large as its own standard error 
(not the standard error of the two scores, but tlie standard 
error of the difTerenco) . It is significant at the 1% level if 
the diJTerence is 2.6 times as large as it*; own standard 
error. You divide the difTerence by its own standard error, 
and ff the quotient is between 2 and 2.6, you are in the 
clear; if it is 2.6 or more, you are on velvet- — or, as the 
statistician would say, "not in the chance domain." There 
is, *)f course, no reason to set any particular limit as the 
boundary between reality and chance, but the 5% and 1% 
levels of signi/icance are most commonly reported for the 
sake of simplicity, Theie are many other "tests of signifi- 
cance,** but this one is probably the most widely used in 
educational research, and sufficiently representaiive to give 
you the basic idea. 

Philosophic digression. Since it is as hard for the writer as 
for an equally non-mathematical reader to keep his mijid 
on the mathematics of the testing situation, i^erhaps we 
both may be forgiven for jjausing a moment to cackle over 
the rather odd definition of reality that has come to be 
accepted as a rule of the game by people who are searching 
for reality in the supremely important area of the growth 
of the mind. Such people may be visualized as primitive 
parents who are standing the minds of their children up 
against the back door and measuring the aspects of those 
minds that they know how to measure' at all with a foot- 
rule* that stretches or contracts every time it is used. All 
that they feel safe in saying- about their measures is that 
two-thirds of the time they come within an inch of the true 
figure, but five percent of the time they are more than two 
inches off. Therefore, before they say that the mind of 
Susie has grown up more than the mind of Joe toward such 
a goal as the appreciation of Hcnnlat, they ask that the'^ 
difference between them be at least twice the amount that 
the ruler will stretch (or contract) in measuring such dif- 
ference.s. and preferably 2.6 times that amount, Since the 
standard error of any one measurement with this ruler is 
one inch, its standard error in measuring a difference will 
he — how much? 

Square the standard error of Susie's measurement. 1'- — 1. 
Square the standard error of Joe's measurement. 1- — ]. 
Add the two squares. 2 
Take the square root: 1.4 

Thus the standard error of our ruler in measuring a 
difference is 1.4 inches. (If you do not know how to extract 
square roots, any math teacher can give you a table of 
squares and square roots of numbers between 1. and 1,000.) 
Then, by the rules of the Ancient and Honorable Order of 
Measures, we are allowed to certify that Susie is bigger 
than Joe in appreciating Hamlet only if she is at least 2.8 
inches bigger on our fallible foot-rule C twice the standard 
error of our instrument in measuring differences). If other 
members of the tribe want to know how certain that verdict 



is. we can tell them that, if there were no true difference,* 
an appar(?nt difTerence as large as this would turn \ip less 
than five times in a hundred measurements of the same 
kind. If they have an immense prize of a ton of gold for 
the best appreciator of Hnmlef. (.surely a wise investment 
for any community) apd want to be surer than that, we can 
insist that Susie be at least 3.6 inches bigger on this wobbly 
instrument (2.6 times the standard error of the instrument 
in measuring difTerences). Then wi? cr-n certify (hat the 
chanc* . re less than one in a hund?-ed that we would get 
a difi'erer.»:e as large as this if there were no true di (Terence. 

Obviously there will be a great clamor among the more 
ignorant members of the tribe tliat this is no way to go 
about it; the thing to do is to buy a steel fooi-ntle that will 
not stretch or sc{ueeze on every measurement and that will 
yield absolutely e:xact results. Alas, there are no such in- 
struments for measuring the growth of the ndnd. and we 
have to ])ut up with tliose we have. Of course, there will be 
members of the tribe who will insist that they can ask Susie 
and Joe five questions about Hamlet and tell you for sure 
which one appreciates it best, hut such j)eoj)le\vill he found 
to differ far more widely in their verdicts than will the 
mea.su rers. 

"All e^iact science." says Bcrtrand Russell in The Scien- 
lific Oulhoh, "is dominated by the idea of approximation. 
When a man tells you that he knows the exact truth about 
anything, you ar^ safe in inferring that he is an inexact 
man." 

Alost of philosophy, as well, has been concerned in one 
way or another witi'i the problem of distinguishing appear- 
ance from realiiy. Like the poor educator who gets fed up 
with the vast amount of nonsense that is talked and written 
about. education, and who turns to testing. to find something 
that is real as a basis for his deductions, the ])hi!osophers 
have been busy since' the beginning of time with the prob- 
lem of separating truth from epinion — warranted asserti- 
bility from^mere assertion. While they have done a great 
deal to cla'tify the problem, there are rot too many in- 
stances in which they have come up with widely understood 
and accepted rules to guide the seeker of reality. Among 
these are the rules of logic^and the canons of scientific in- 
vestigation. Far down among the latter is the convention 
that a difference may be accepted as real (as caused by 
something other than the vagaries of the measuring instru- 
ment) only if it is twice as great as the standard error of 
the instrument in measuring differences, and preferably 
2.6 times as great. That sort of ground-rule for conducting 
an inquiry into the truth about education would have inter- 
ested Plato, and he would probably have approved of it, 
since he was a good mathematician himself and regarded 
ma'.hematics as a basic discipline for anyone seriously in- 
ter(»sted in the search for reality. 

A few disgraceful members of the teaching profession 
mav wonder why anyone should have any trouble discover- 
ing what is real about education. What is real about it, they 
will tell you, is the sweat, the smell, the noise, the trouble 
with discipline, the overcrowded classes, the low pay. and 
so on. If anyone professes to find reality in education by 
the process of computing standard errors of differences, 
they will hoot with derision. We might agree that these are 
some of the unpleasant realities in the joB of educating as 
it is now conducted, but we are not interested in them; we 
want to find out what is real in the process of educating: 
that is. in assisting the growth of the mind (not just in gen- 
eral but in specified dimensions, such as in spelling, in 



antlimelic. in rending coniprelie.nsion, and so on up .0 (he 
appreciation' of ■/7f/7n/<?0. Tf wo looked for such growUi nmicl: 
tho noise and- smells of the classroom of the nnW'e ro/ilistj 
we miglit'finfl none at al!. Who, then, iss overlookii,j? i:he 
reality; tho mo^stiror who does not care al^onl Ihe noisci 
or Ui(? reah'st who rloes no(. caro ahoiit odncation? Holh 
ignore certain aspects of realily. hvit'lhe pari that, the real- 
ist'exxlmlos from consideration seems to many level -headed 
poojilo far more important 

At ni^other end of bhe spectrum are some very nice peo- 
ple who find what is real in education in the lic(ht that is 
in the eyes of tho children, in the lilt of their voices, in the 
cute thihijs they say, and in the charm of their artisi ic, pro- 
ductions. They, also, would deplore the quest for a-vroality 
that is certified by two standard errors. But they woidd 
also have to assent to the proposition that their joh is not 
limited to keeping? students hap])y and creative; they have 
to assist tfie growth of the mind; and it is ihelr 'nypothesis 
that happiness and creafivity assist that growth better than 
blood, s\v'eat, and (ears. Very well — but that hypothesis re- 
quires evidence. The evidence cannot be that the children 
arc in fact happy and creative. It must show that they 
learn more than when they are unhappy aiid nncreatjvc. 
And to show that they learn more — there you have a difTer- 
ence, and it is good di.scipline in thinking about education 
to refuse to recognize it as "re^l'' unities it is at least twice 
as great as the-standard error in measuring such difTevences. 



'Hp' 



The stand tird error of^aR aveKage. While tlie reader niay 
look upon this- heading .gloomily as *'more of the same/' tlie 
|m>per response to jit, if |h oiily_,l!nev. .is I^Hppe vlMagihs^lio 
dawn," or "The Uriite;<l;JSla^ ; / 

He must have wondered lTW';h%c;o tliat 
any distance ..betweeivH^^^ : pqints;|i^^^ reab 
wherv tlie few t-rulc; we cpnj^ rheasiiring: apprecia- 

tion of Hamlet \y as J in^^cpmm language, .■■accurate: within 
one inch," yet thef minimum i^ifTer^^ could certify as 

real turned out to^fe 2i8 inches. Also, the standard error 
of most classroom tests is about three raw-score. points, yet 
the minimum dilTerence between two scores that \we couUl 
certify as real (at the 5% level) was BYo points. At that 
rate, alL-that we- could assert a boiit' "the disti-ibutjon of 
scores on rnost classroom tests would be that most of the 
students in the top quarter of scores on this test were prob- 
ably superior to most students in the bottom" quarter. We 
could make J10 assertion with confidence about the scores of 
the middle half of the class. * ^ 

All of this is sad "but true; there is very little hope of 
proving an3'thing-in education with single measures. The 
real hojDe lies hi repeated measurements: either testing: 
many students with each single measure, or testing the 
same student with many difTerent measures in the coiifse 
of the. year. The reason is that the standard error of an 
average is much smaller than the standard errors of the 
scores that enter into it. With each additional case or meas- 
ure,, the standard., error gets smaller, urtll in practice it is 
really, not dijffi<::uit to prove. thatr Sorurj things work better 
than others, or that; sorneVst^^^^ superior to others 

wi' h respect: to V any giyeiyp 

The rwiiy to cpmi^u^^^^^ terror df a class a\fe rage 

is to^iyidefthe ^*stanJto3•^ 

square^^ropt^off^ averagi ng 

mk ny'tests;of|H {for^ 
a.:f ingl e-^tudehtfiyjf^ i vid6'"|he 
sc<yr'es. by' the- squ^^ 



■'0 
ERIC 



'standard rleviation" 



general .statement that- inkeB in both c f these cases is that 
the slandard error of an average is' ths afandard deviation, 
of fha nu'a'si/re$ divided by I ha squon] root of the number 
of measnr(?s'?. Tf the number of measures is loss than thirty, 
yoti are sup]*>osed to divide hy .the square root of one less 
thau'lho number of measures ( VN ~ 1 ) . 

Now we have to find out what the 
is ;.md how to compute it. Tliis is more imiDortant than you 
may think, for practically every other statistic that you 
will ever comjjute has the "standard d.eviation" somewhere 
In its. formula, It is like the recipe for|'*whi<e sauce" in tho 
cookhooks. You may ski]) it on the ground .that you don't 
care for white sa\jce and want to get dn to something moTci 
exotic, hut you find that most of the recipes for other 
sauces hegin. ''First make some white sauce. Then. ..." 

There is s very simple way to find the- standard devia- 
tion, proposed hy W. L. Jenkins of Lehigh University, that 
will work well enough vvhen .vou ar^. in a hurry and when 
the distribution of scores is appro.viinately "normaT'^^^^that * 
is. when it resembles the familiar "bell-shaped curve." Sub- 
tract tlie sum of the bottom .sixth of scores from tjie sum of 
tho top sixth and divide by half the number of students 
testc*d: : 



Standard deviation — 



Sum of high sixth — sum of low six(;h 
Half the numher of students 



Let us try this formula on the following distribution of js> 
"^scores on a test of ^10 items: 



;iJ 
30 
29 
28 
27 
25 



24 •^ 

23 3 

22 3 

21 5 

20 3 

19 3 

18 3 



17 
16 
15 
11 
13 
12 
11* 



There are 45 student!^. A sixth jo f 45 is 7^2 students. 
Ordinarily we would say **Forget about the half" or *'Tjike 
the jiextJiigher-numhev,"- liut here*' j.lTe formula itself is ah 
approximation; hence the numbers that go into it ought to 
he as nearly accurate as we can manage. While there would- 
be no way to take half of the eighth student from the top, 
we can jolly well take half of his ^ore. Hence we add the 
.first seven scores down from the top and then add half of 
the eighth score. The sum of these is 21 6. Then we add the 
seven scores from the bottom plus half the eighth score. 
The sum of these is 102. Subtracting 102 from 21.6 gives us 

Now we have to divide 114 by half the number nf stu- • 
dents, which; is 22.5. In the itern-analysis, we left out that 
half student^ since it would have been impossible to get 
half of him ito sit with the highs and half with the lows. 
Here there, is no point in leaving him o\it, since it is almost 
as easy to divide by 22.5 as it, is to divide by 22. The quo^ 
lient is 5.06,j which rounds to 5 asjtlie nearest whole nurn- ^ 
ber~— the same as you vvqiild ; get iri' computing the standard 
deviation by {orthodox propedures; - 1 ■ 

Now,;^the,standard error of; t score on tliis test 

is the standavd'devia^^^^ divxded hy the square root of the 
number 0/ ^studisiiis. XSihce tlie numlier iis ahoye^^^^^ .we-can^ • 
forfe.dtiaboutf taking^ less^than'^jth^'.niM 
Thev.Sjq uareMpp^ isr^^ 6.7 s tiidents^ ^^Dpn'tv;;: . 

bo the r^Vto (C&^ te^'i t jUopk^li g u an - affable ^o f isquafre '-itdo^-A 



The sUindnrd clcviotion, 5.. cHvidefl by 6.7 « 50.00 divide^.! by 
67=«.75 (rounding to the nonrost hundredth). 

Npw you can see how the sbandnrd error of an avejrage 
compnres with tlio standard error of the scores that enter 
into It. Since this was a test of -10 items, the standai^d 
error of each score was api)roximately 3 raw-score point.s. 
The standard error of the average of the class now i irns 
out to be only three-f|uarters, of a point This means that 
tho:chances are t.wo out of three that the true average of 
the class on exactly this soit of (ei't at the i)resent time lies 
within .7.5 points of the average they got on this occasion 
f2l.06 if you want to figure it out). The chances are 21 to 
1 that it lies wit.hia 1.5 points: that is, that the true aveifage 
lies betweeii 19.56 and 22.56. 

This ought to^ show you why it is still possible to 
things out about education by means of tests even tho'ugli 
the standard error of an individual score is quite large. 
Most of the time you are not dealing with inchviduals 
with classes. You have not taught Hamlet in one wa\ 
Susie and in another way to Joe, but you may well 1 ave 
taught it in two different. ways to two difTerent ciassej; of 
approximately equal ability (for example, by using 
admirable Maynard Mack film in one class but not in 
other). The average scores of tlie two cla.sses on the .^i 
test may very well tell you whether the fdm marie 
average difl'erence. Remember, ho'^nver, that you njiust 
take the standard error of the f///7cre/2ce between the two 
averages rather than the standard error of either average. 
This is computed exactly as^the standard error of a differ- 

ors 
70 1. 



find 



but 
to 



the 
the 
I me 
iny 



once was computed on page 5: square the standard er 
of the two averages, add them, -and take the square r 



Standard error of a dilTorence between averages. V^e shculd 
like to run through this process, once more^-nsing; n ore 
orthodox procedures, since there are jnahy situations in 
which the simple Jenkins fornni!;^ "vvill not worki Ciief 
among these is the situatiori^inl^which the distributipr, of 
, . scores does riot look anyihing 

cur_ve;, a s..on-a- mastery ties t^^^ of vthe . scores are 

within a. fe\v^)oints oCa peVfectVs^ it is' hard to 

apply to^Jetter-grades, where the spread iifr scores is very 
small: Third, it may be difficult to apply, and entail large 
^.. random errors, -when the number of measures to be a .'er- 
"-aged is very small. We shall take up this last case in the 
example below,- since it will serve to illustrate the standard 
\^procedure with a minimum of numbers. 
>. T*^^^ problem arose when the writer and his friends vvere 
-•' ""-fiving in Chicagf) and had a choice be tvs^een the Pennsyl- 
^ vania and the New York (jentral in getting to New Y^rk. 
Most of the men preferred the Central on the ground that 
it was smoother. Just to be ornery, the writer argiied mat 
they were the victims of propaganda: they had been 
ing the slogan *'The Water-LeVel Route — You Can Sleep" 
for so many years that they had come to believe it. The 
writer argued thai there was no true difference in bumpi- 
ness at all. 

Since these were measurement men, they naturally cast 
about for. some means of ; measuring bumiDiness. One of 
them found an empty bottle that had contained -Aqua 
Velva Shf-iving Lotion.vjt \yas adrnirable for the purpose, 
since it ,was a. square bottle that, could} be ^ fe precisely 
in one position pri.:its;<',side artcl:i^jt> had a narrow iniputh 
through iijwKidviwafe^ 

eyery^^burnp, JThey , filleid; it.^ half f ull;; ofi wa t ei^u p ^o 'the ^ 
poi n t: a t Syh j ch,: j ust 



bottle was laid- on its side. Then some tidy soul objected 
that they ought. not to let the water spurt out on the floor 
of the car. or the jDoVter might interrupt the experiment. 
This i)roblem was solved when they got to Cleveland, which 
has a toy shop in the terminal. They bought a toy balloon 
and .slipped it' over the mouth of the bottle to catoh the 
spilled water.' 

Then, when tliey all agreed that the train was going full 
speed, they laid the bottle on its side on the window-ledge 
of the car, pointing towarrl the aisle, so that it would make 
no difTerence whether the train was going uphill or down- 
hill. They left it there five minutes and then took a read- 
ing to find out how much water had been displaced. (They 
had marked of! a scale, in millimeters on the side of the 
label.) After half an hour, when the train again was going* 
full speed, they took another reading. There was time for 
only five readings before they went to bed. On the return 
trip, they changed tickets to the Pennsylvania and took 
five readings under exn'""y the same conditions. Since five 
flollars was riding on the outcome, they all checked every 
measure to make sure that there was no- mistake and noth- 
ing unfair about the reading.. 

it turned out that t;he Central displaced an average of 9 
millimeters of water j)er reading while the Pennsy dis- 
placed 14. This would hayo^:>een .enough for the average 
bet, but these were measurement men, so they insisted that 
the difrerencoJ:>e significantat.Mie/5%.J before 
the bet would be paid. They quickly perforined the neces- 
sary calculations on t:he baclc-;pf an. envelope. .and found 
that'the diflereiice was significa^^ 1% level. 

Hence there was less than^ bne^c^ in a|hun^ that 
further readings, no mattei^ :Iip>^' ^ m tiniest repeated, 
would finally average out tola verdict of "ho ,^!real' difTer- 
ence." How did they figure. it? , , 

Th e back o f the e n vel o pe Jigoked more, or.: less - 1 ike; tliis: 















Cvtflral 






Cv. • Pi'mn>'\\ 




Score f' </ 




£ 


. Score - • 


f 




\2 I • ;} 


:\ 


9 


\7 


1 


5 3 9 


U 0 2 


0 


0 


16 


0 


2 0 0 


JO 1 I 


1 


y 


]S 


1 


1 1 i 


9 1 0 


0 


b 


U 


1 


0 .0 0 


8 X J "i 


-1 


1 


■ li 


1 


-1 -^1 1 


7 i) -2 


0 


b 


12 


0 


-2 0 0 


A. I -3 
N = .5. . 


-3 




11 

l\ = 5. 


1 


~3 -3 9 
20, 


|i 


2, S.D. uva 






= 2, S.D. or 


V |\ - 1 


\ 


2 2 
•' 5 - 1 " N'^-l " 


2 

= ^- = l.-rilnmlani 
2 


(!i rt)i of r«ch ,iv».' 



S.K.j.ii. == N^l'+r^ = N''2 ~ l''^» ll>** Klaii'ljinl crroY of l!ie dijj'cmia'. 



Of course, there is quite a lot to explain here, but the 
actual operation's are as simple as falling off a log. After 
each score, you -put down 'how many times it occurred 
under f (frequency). Here" none of the scores occurred 
m or e; t ha n o nc e , ; 'a n d ' scp r e s p f ' II' a n d 7 on - th e Cent rat, and 
of": 16-. and ,1:2 ;o h|th e .' vP.e ri nsyi(^ didchdt occur at 'all; \b'ut we. 
ha^ve^'enteredjl^lthern"^^^ 0^ ' to' niafte ■ -i^ clearer • whaV Aye are 
doingf^inVthe'cpl^^ 
.ch-'m-'botjf ^ 

that ^Ipoicjiiike?^-^ ' tell ^' lip w ^ffS''^ 





away each score is, from Uio middle soorc: That is why the 
column is hec'-uled d, sU^nding for "deviations/' The midflle 
score docs not devfato nt fdl from itself, so lis deyUiUon is 
0, and is so entered. You can always fill out ihe^VF' column 
quite antoinnlically, simply numbering uj7 and down from 
the mUklle score: The next cohnnn is headed *'fd," and 
. wiiat does Uiat suggest from your memories of nlgobra? It 
sujygests that you miiUipIy eai:h f by tlio, corresponding d 
to get fd; and tliat is prdcisoly what you do: you multiply 
the socoiid column , hy the third to get the fourth., Then 
what does ftP suggest? It sug)l^est.s that if you nuilti[)ly the 
third cohtmn by the fourth, you will get the fifth — since- 
dXfd^fd"-*. Notice that wherever a zero entc^rs into the 
multiplication, tlie prrichict in zero, and notice tiiat when 
you multiply two negative numbers together, as in columns 
three and four, the .product is positive, as in column five. 
Yo|i add all those products in column five and write the 
suiT^ at the bottom of the column. Tiie rather odd synihol 
annexed to it, i:. is the Groei< capital S, nnd simply nieans 
^"sum of." You (hvide this snm,\20, hy the number of meas- 
ures, 5, and got 4, the average squared deviation. The 
square root of '1 = 2, whicli is tlie ^'standard deviation" of 
the scorers for both the Central and the Pennsy. computed 
by orthodox and standard- procedures that you can apply 
(with a little practice) to any distribution of test scores. 
For practice, you miglit apply it. to the. dLstribution of 
spores on page .7. The sum .of the squared deviations 
7(:sfd-) in that case^^shpuld come out to 1.129. pividing by 
N, 45, you get 25,-and. JJie square,root oi that^ 5— the 
sanie as in the shorter Jenkins rndthqdL_^^^. — — T 

The two lines ^Lfiguresrbelow^ the point at- which we 
JoLiiid-tho '*^*;ta1a7iard deviations railroads should 

by now be farn il ia r ierri to ry' tlWt we li a vc tra versed . on i foot. 
It wili be good discipline - for you tpiyead ^veiy ' S^^ 
these tvvo lines and niake sure that you; thow .why it is 
tliere. ln..tlie; first ;of Thes<j-!line^ beginning ^S^E., what:: does; 
the 'S^E! stand for^^^ pf course; ^.as] is -writ- ■ 

ten out at tHe end^qt the lin^ standard;' error 

is it? The -standard error o^ aVr ai^erni?^ 
means that'we can use. the formula; standarcljjeviation of 
the measures divided by the square root op^e less than 
the number of measures (page jy;r--i5Jfi^^ found that 
the standard deviation of thes^measure.'? is 2 (for both 
railroads). The number of. rnof^sures in each case is 5. One 
less than this number is 4. The square root of i'is 2. Hence 
the standard error of each average is 2 over 2, which, is 1. 
See whether you can^jj^read all this in the single line of 
figures that begins "S.E.'' 

Tlien, in the last line, S.E.,(ifr. pretty obviously stands 
for the standard error of the difference between these two 
average.^?: the square root of the sum of squares of the two 
.separate standard errors> Since both have a standard error 
of I, the square is also 1, and the sum of the two squares 
is 2. The square root of 2 (look it up!) is 1.4. The least that 
the two averages can differ, therefore, and have us certify 
it as a real dilTerence at the ,5% level, is 2.8 points (milli- 
meters). If they differ by niqre than 3.6 points. (2.6 tirae.s 
the standard error, of the difference) y we can certify it as 
'■significant, at. the? 1% leve^^Sinpe -the actual 
. between the tvyo averages Was.. 5 pqiiitsrit is .bfeviously. far 
and avv'ay bej5|^0|nd rthe . 1% ley^^ far , 1 ess; than one 

chance An . a jiundre^^ the obtai n ed, di fTeren^ 
the(iwo.ia^ ,np 
compulsion} to stay -up ^all,. night, on all subsequentotrips 
■.b|etweeh^Gh 



ERIC 



of the two roads over every mile of roadbed. They hod 
enough confidence in their statislical.,theory to realize that 
such efTort would be wasted. There wtis considerably less 
ihun one chance in a hundred that any subsequent meas- 
ure in qji I of the same sort would Qvor ui)set die general 
verdict that "the Pennsy is bumpier than the Central be- 
tween Chicago and New York." 

Obviously such a conclusion would make theVpublic- rela- 
tions officers of -the Pennsy apoplectic wVtli rage, and ihey 
might i>e tempted to spend fifty thousand dollars building, 
.some kind of go-cart to trail behind, their trains in order, to 
measure bumpiness with greater precision. Bui the whole 
theory of measurement suggests that such an investment - 
would be unwise. When a difTerence gets out beyond the 
1% 4evel with even crude but fair measures, it is highly 
unlikeJy that refiuemcnt of the measures will show a true 
difference in the opposite direction. 

We are now in a better position to appreciate what a 
''standard deviation'' is. It is a kind of average of how far 
the scores are spread out from the an i dell e seore, or mean. 
One standard deviation above and one stan<lard deviation 
below, tlie mean will enclose two-thirds of the scores if the 
distribution is norma). Two above and two below will en- 
close 95% of the scores. 7^his sounds exactly like the stan- 
dard. 6rror — and, in fact, the two have the same' bas'is in .. 
statistical theory. ,But notice that the standard-error'en- 
clo.sed hypothetical scores: thcviimifs wT^^^^ which scores 
.might f^l] J.>y^.pur.e chance in .the; selection of items if the 
"Sanie^'studentrwere. given an infinite; number of parallel 
. f orins ( w ith 6u t ■ 1 e^ fmn^ a ny i:h ing. or. f or gc tti ng^^ny thingT: 
Th e stand a rd d eyi a t ion en c) oses ' thb a ct tia )' scbres hiade- by 
a given class in any one administraii^6n of the test or, in. ' 
this case, the : actual scores \made by two difTerent subjects 
Jn:five:administrations.bf th^sam^ : , . : .- 

It IS "^vordi retnembering that the standard deyiation will ' 
usually lie. bet;vvoen;-10%/and i>2()%:^:0f ;the, numbeir- of • items 
;iri;the rtest, ekcep,^^^^^ wtrich mpst\students> . 

come -.close to a perfect, score/ \yli^ smaller.. J f . ■ 

ypi^havci tpi malce a cfuick guess,- probably the iiaf est;' guess "; 
for inpst teacher-made tests - (except niastery tests) is that ; 
the standard dcviiition will be 15% pf the number of items 
in the 'test: ' ^ . • ' 

Since thie .actual scores made by a class will ordinarily 
spread out farthier than the hypothetical scores- that any'" 
individual might make on parallel forms, we must expect 
the standard deviation to be larger than the standard error 
of an incUvidvtal test score. /This is, in fact, what we found 
for the distribution of-. scores printed on page 7. The stan- 
dard deviation of these" scored was 5 raw-score points; tlte 
standard error of any individual score within tliis. distribu- 
tion was approximately 3 raw-score points; the standard 
error of a difference between any. two of these scores was 
4yi points: and the standard error of the .class average on 
this test was onl>^ .75 of one raw- score point These figures 
will give you an idea of the relative order of size of the 
quajT titles \ye have been talking about up to this, point 



Reliabiiity - 

Test rdiability;,iYye are now in a position to compute tfie'- . - 
r ej iabil i ty 61 .o^^ecil ye tests ; r n vWh icli a 11 items; rare gi yen- , 
equal) \yeiglit^ ftiwill^ take' a)3pi:o^ 

■yoii'knpW ; tHijiii^a ndard . 
f or th e s ta jid a rd ?f de via t ion ; escap es your ^ memory ^^du]^ wjII^^ " ' ' 




find it on page 7-— but: thai '.di one you ought to h4^n^ by 
heart.) The rcliabib'ty of the test cloponds on ju^t three 
quantities: the number of items, the standarpi deviation, 
and the mean* (average). If we use n for nup^)er of Hems 
(not number of .'.tudents. remombprJ) . s wr the standard 
deviation, and A/ for tlie moan, the fornyfla for computing 
the reliability of a test is the following:^ 

re].-l - . (Kudor-Ricbrfirdson Formula 21) 

/ 

In the scores printed oii^prt<^ 7. the mean was 21. the 
number of items wiijMtJ^incj^o tlie number of items minus 
the meaniwas^9. 2jxi9=V'j99. In the denominator, ii was 
40 and^The square of the^tandard deviation was 2fi. 40x25 
= 1,000.' Rounding a In/ we get 400 over 1,000 or .4. Then 
— do not forget this-^we subtract A from 1 and get .6 (or 
.60, if that looks i^ore familiar) as the reliability of the 
test. / 

If even this^mch cominitation leaves you. cold, you can 
find the aj^pToximate reliability of most- of your tests in 
one of tht/following tables. If the average score on your 
test is between 70% and 90% correct, use tlie first table. 
H it i^-^etvveen 50% and 70% correct, use the second table. 
Thpn compute the standard deviation of your test by the 
sbOrteut formula on page 7. If the standard deviation, 
/(labeled S.D. in the tables) is nearest to 10% of the item.s, 
use line 1; if^l5%>, use line 2; if 20% (which happens very 
rarely),. use h*ne 3. If you have to guess, use h*ne 2. Then 
choose the column that is nearest to the number of items 
in your test. The figure at the intersection of this i*ow and 
column will be the appro.ximate reliability of your test. 

; \ 

Appro.viniJite Rclinbilily of I'^asy csls (nv«rsigu 7(\% to 90% correcl) 
Nuinbf;rofiiems{n) 20 30 40 50 60 \ 70 80 90 100 



If S.D. is .lOn 
If S.D. is .15n 
If S.D. is .20n 



.21 .45 .62 

.68 m 

.84 .90 



.69 .75 (..78 
M .90 .91 
.94 .95 .96 



.81 .83 .85 
..92 .93 .94 
.96 .97 .97 



\pproxiinaLe Hi4i«I>ilily of M;ii'<l Tests fnvnnige 50% to 70% correct) 
Nuinlicr of items (n) 20 .SO 40' 50 60 70 80 90 100 



If S.D. is .iOn 
If S.D. is .I5n 
!f S.D. is .20n 



p .21 .41 
.|l9 .67 J5 
."4 .83 .87 



.5.'i. .61 .66 
.80 .84 .86 
.90 .92 .93 



.71 .74 .77 
.88 .89 .90 
.94 .94 .95 



These reliability coefficients ' are conservative estimates 
of the correlation you woukl get if 3'^ou administered two 
parallel forms of the test so closely togetlier that no learn- 
ing took place between them and compu^,ed the correlation 
between the two sets of scores. In simpler terms, test relia- 
bility is an estin;iate of how close you would coine to the 
same set of scores if you gave a parallel form of the "test. 
It is not a percent and should never be referred to as '*a 
reliability of 60%." or **60% reliable." 

Note the decisive effect of the standard 'deviation-^ 
because it is in the denominator of the reliability formula 
and squared. A large number in the denominator at this 
point will make a smaller cjuantit^*^ to be subtracted from 
1 and hence leave a larger reliability. The number of items, 
n, also in^the denominator, has a similar effect. The loca- 
tion of the mean, M, in the numerator may seem to give 
an advantage to easy tests, but this is more than offset hy 
the fact that such tests generally have a' smaller standard 
deviation. 



ERIC 



W6 are often asked what level of reliability is satisfac- 
tory. The answer has to be whatever yoii can get in a 
^iven field within given time limits." Test publishers have 
tradifionally not been satisfied with reliabilities less than 
.90. hut teachor-made tests nmst usually settle for less. 
Over 300 teachers liave attended the writer's classes in 
measurement, and most of these have produced tests and 
tried them out'ln their own classes. Most of those that the 
writer regarded as. good, usable test^ achie^'cd reliabilities 
between .60 and -SO. If we wanted a test to be highly 
reliable to serve ns a final examination, we usually found 
that it took two class periods and had to be administered 
•on two successive days: Part I on Thursday, for example, 
mul Part IT on Friday. 

It is good to compute these reliabilities routinely because 
they take only about two minutes apiece and flash a warn- 
ing signal when the reliability dij^s so low (as a rough 
rule-of-thumb, below .60) that the scores are hardly worth 
recording. They will also set you up in the eyes of your 
colleagues as a man of science, since one of the few terms 
tliey have heard about is ^'reliability." They vaguely believe 
that it takes vast erudition ""tind possibly an electronic com- 
f3uter to compute reliability, and they will be greatly, im- 
pressed if you can do it in two minutes for any of your 
tests on the back of an enveloi^e. Still, you must not let^ 
them go away with the idea that reliabiiity is the only 
virtue in a test. The easiest way to achieve it would be 
to ask a large number of petty factual questions in a form 
that could be answL^red very rapidly, so that you might get 
100 answers from^oach student within one class period. 
They would . probably hit a reliability of .90, and since the 
brighter and better students would probably!, get higher 
scores than the dulhand lazy,, the scprei* might have quite 
. a respectable correlation with your grades. Still, you would 
know, your colleagues would kifiow, and your students 
would know that it was a lousy , test. The thing to do, there- 
fore, is to make the best test you,c'i^n .within the time-limits 
yo.u have fivailable and then compute the reliability. If it is 
unsatisfactory, it only, means that you need more items to 
work up to a stable score; hence make another test. The 
following forrnula will tell you how many times to lengthen 
the test to get up to any desired reliability: 

(The reliability you want) X (l-the reliability you got) 
(The reliability you got) X (1 -the reliability you want) . 

If you want .90 and got ,60 with your first test, this becomes: 
.90 X (1-.60) .90 X .40 .3600 



.60 X (1^.90) .60 X .10 .0600" 



= 6 (times' longer) 



Thus, it takes 6 tests with a reliability of .60 to work up to 
a reliability of .90. Also; it takes 3 tests with a reliability 
of .75 to work up to a reliability of .90. Either of these is 
entirely feasible if you have the students for ^ semester or 
for a year. Simply make up more tests of the same ability. 

This formula seems inconsistent with the effect of the 
standard deviation — the spread of scores — on reliability, 
and to make rel inability entirely a function of t.hG number 
of items in the test. The supposed inconsistency can be 
straightened out as- follows. Suppose you have just given a 
test on appreciation of Hamlet to your Advanced Place- 
ment Class of superior students, and its reliability with this 
class turns out to be .60. That means tliat if you gave 
another test of the same kind to the same class tomorrow, 
quite a few students would change position enough to affect 



Iheir ^raclo, Tliort^ are two ways in which yoCi could 
increase this rnjiability. Ono would he \o go across the hall 
and .'ichnjntshn- the snnio to n regular, nnsGlecled class 
fhat liafi:J everybody in it from geniuses In morons. The 
rehahility ov(?r Yhere nii^ht well i^o up to .90, since these 
people flifferer] sn widely in nbiljiy that another l:est iho 
samp kind wouM not siiift the rank*order of very many 
stuflenis. This i no lest would be .'^ullicierit to fxivo that 
chiss reliable gra les on Hdnilrt. '*Rut/' yon would properly 
arpfiio/ "I am n< t re.«iponRi!^le for <ho p:rades of tho claKs 
acro.ss the hall. \ am responsilih^ for the grades of this 
par/icLflnr cla.ss: and 1 want them to )io su/Bcientiy relinhle 
so that ono more test woultl not shift them in very many 
inslanccs." Hence you ap))ly tin- forcKoing formula and 
find out that you >voi]ld have to i^We six tests of this kind 
to this particular class during tlic unit on Hamlet to fiei 
tiiv.jr scores up to a relinhility of .90. The fncmula applies 
only to the .s-ort of i,'rouf; thnt you have just tested, an<.l it 
assvmics thai the range of ability within this ^roup is" not 
gnin^: lo change apj^rociahty during those .six tests. For this 

'.jToason, tho n»hahtlily can he proclicted on Ihc basis of 
number of items alone, assuming that the true sianrlard 

. deviation within this group is seing to remain constant. 
We must not forget the ies.son of our first section: thai 
retiahiiity can be increased (i)er unit of testing tinie) J>y 

' dropping or touching up items that proved to lie too harfl. 

r>too oa.sy, or non-discrimin*Tting, This, also, is not inconsis- 
tent with the formula for lengthening the test, Tiiat 
formula merely .says, '*Given the kinds of items you liave 
now, it will take X limes more items to boost reliability to 
.90.'' But if you drop hopeless itenis and improve others, 
-the desired reiialiility may well be attained with fewer 
items than the formula predicts. 

Correlation ' 

Correlation. This is the other n.agic word from tlie art and 
mystery of testing. If you can clo both reliabilities and 
correlations and come up with results within five minutes, 
your coHeagues will regard you as another Einstein. Actu- 
ally, any mode ratqly bright eighth grader who has been 
getting B's in arithmetic can learn how to do the simpler 
kind of correlation in about fifteen niinutes,. and it should 
not take him lonser tliSn five minutes to compute one for 
a ehiss of average size. 

Here's how to do it: find the percentage of students who 
stood in tlie top half of the group on both measures you 
are correlating and look up the correlation (r) corres]>ond- 
ing lo this percentage in the following table; 



%• 


r 




X 




1 




r 


p.* 


r 


45 












21 


^.25 


13 


^) 






— i^i— 




23' 




20 


~.:n 


12 


-.73 




.91 


35 


.60 




.13 


(0 


-.37 


• |] 


<- _77 




.88 


'M 


.55 


2b 


.07 


m 


~-A:\ 


10 


-M 


41 


^8^> 




A9 


25 


.00 


17 






-.85 


^10 


.HI 


.32 


.1:5 


2! 


-.07 


16 


-.55 


8 


-MH 


39 


,77 


31' 


.37 


23" 


-.13 


15 


-.60 






38 


,73 


M) 


\3l 


22 


-.19 


14 


-.65 


6 


-.93 



These are called ''tetrachoric correJations.'l while the 
more common but more difFicutt kind are called "product- 
'moment correlations." They mean the same thing, in the 
sense that tlie tetrachoric yields a fairly accurate estimate 



of tl'io correlation that you would got by the product- ' 
m lament nielhofl. Tetrachorics are perfectly respectable and 
are often used in educational research, hut you* can see 
thai they are not very precise, since a difference of 1% 
can make a difference as great as .07 in the cjrroiation. 
However, the reliability of t.iie data that teachers ifjjtially 
ha\'e to work witii and the relatively small nimibers of 
studimts involved usually do not justify more precise 
methods uf fompul.ation. The best you can hof)c to get by 
any metliod is a rough idea of the general order of mag- 
nitude of the relationship. 

Since even 1% of tho students can make so much difler- 
ence 'm the correlation, it is important lo use a standard, 
uniform melhofi of counting; liow many students stood in 
the toi^ half on each measure. We trust that yo\i know liow" 
fo find the middle score, on each measure. List tho scores 
on each moa.sure from higliest to lowest and put a tally 
uftor each score for each student who made it. After all 
tlie score.<5 have been tallied, count down the tallies to half 
the number of st^udents in the grovip. 'Phe score at which 
this iTiid'lle tally falls is tho middle score. 

You will ordinarily have the st udents listed in alph?*})etical 
order, and after each name you will have the two scores 
tfiat \'ou arr correlating,. After you have found tho middle 
score on each measure, go down the list and put a check 
after each score that stands above the middle score on that 
measure: a straight line after each score that stands at 
the middlei' score. Do this separately for each of the two 
measures. 

Then, if yon need three more students witli middle 
scores on Measure A to take in half of the group, put a 
check til rough the first three straight lines on Measure A 
that ydu come to in alphabetical :0'rder. If you. need five' 
more students with middle scores on Measure B, put a 
check through the first five straight lines' after the scores 
on that measure. Then count how many students have /ajo 
checks after their names. Turn this number into a percent 
by dividing it by the iota.} number of students (not by the 
number in the top half). Look up this percent in the fore- 
going table. The decimal corresponding to it will be the 
correlation between the two measures. 

It is not necessary for tfie two measures to be^n any- 
■' ing like the same scale. It^ji:--i:)erfectly valid, for eKaniple. 
to correlate height in inches with weight in pounds;- or 
scores on an objective test that run from 200 to 800 with 
scores on an essay that nui from^ I to 9. All that is neces- 
sary is to count liow many students slo(id in the top half 
of this same group on both measures. 

■ 11 is impossible and meaningless, however, to 'Correlate 
the scores of two direrenl groups oji the same measure: 
for cvanA\?le. to eorrelate the scores of the hoys with those 
of tlie girls. You start with a single list of naiaos, each 01 
which has two scores after it. Then you can correlate the 
first set of scores with f^he second set .^f scoroa. But if you 
have two separate lists of names, each with a single score 
after it, there is no way to count how many students who 
stood high on the first measure also stood high on the 
second. There is only one measure. 

Teachers ^^ften speak loosely of * 'correlating" one class 
with anoth when they really mean "comparing/* They 
use the longer term only because it sounds more scientific 
J.0 Iheni; but to anyone who knows what a correlation 
means, it is the most flagrant of boners.' There is no v^ay 
to correlate two group? of students on the same measure; 



ono cnn only correlate two sets of measures on the sain*' 
students. To compare the performance of two groups of 
Ktuflents on iho same test or other mefi>{i]re, you compare 
their avorn^'es, anfl if you want to Pnd out whether the 
averages were *'really'* dirTerent. you compote the standard 
errors of these averagers and tlien the standard error of the 
clifterenre. as we explained cm |)aKes 7-9. 

The general meaning of correlation may he remembered 
this way. A fjositive correlation means that the higher a 
student stood on one measure, the higher he stoocj on the 
other, A negative correlation means th«it the hl^rher he 
stood on one measure, the lower he stood on the other. 
(We often get such correlations: for example, between 
number of errors in a composition and teacher;-' grades on 
those compositions.) A zero or near-zero correlation 
(roughly from .25 to -.25) means that a student who slood 
high on one measit.re might stand anywhere at all on the 
other (for example, the correlation between height and 
T.Q.). 

The topic of correlation is closely related to the preced- 
ing topic of reliability, because often the only way of com- 
imting the reliability of a test is to give two tests of the 
.same ability and correlate the two sets of scores. This is 
true of (a) essay tests and (b) tests in which the items 
receive different numbers of points. The Kuder-Richardson 
Formula 21 given on page 10 will work only for objective 
testes in which all items are scored either 1 or 0: that is, 
as either right or not- right (wrongs and omits counting 
equally as not-right). It is also true (although this principle 



is often violated) of tests in whicli more than 20% of the 
sturlents were unable to finislr. that is, of speeded tests./ 
Sfieed spuriously increases reliability fo an extent that, if' 
the less able students vvere able to finish only half the test,, 
if would be almost impossible to get a low reliahitity. Yet 
Homefimes it is appropriate and necessary to give a speeded 
test. In such case> . the only fair, acceptable way to esti . 
mate rerabiiity is to give two tests of the same sort and 
compute tl.t^ correlation between the two set* of scores. 

Sometimes teachers cheat themselves by securing two 
e.s.says. each graded independently, for their final examina- 
tion; hy correlating grades on the first set of essays with 
grades on the second sot; and by railing that correlation 
.the reliability of the examinatiin. It is not: it is the reli- 
ability of one es.say. If you use the sum or average of both 
essay gra<les as the grade for the examination, its reliability 
is twice the correlation divided by (me plus the correlation. 
For example, if the correlation is .60. 

This is called the *'Spearman- Brown Prophecy Formula." 
Another form of it appears on page 10. It should also be 
use<l whenever you are computing reliabilities by the old 
method of correlating scores on even-numbered items with 
scores on odd-numbered .items. The correlation you get is 
the reliability of half the test To get the reliability of the 
whole test, do as above: double the correlation and divide 
by one plus that correlation. 



\ 

\ 

03047 . T63P25 . 27S206 



12 



